Assignment 1-1: Web Scraping¶

Objective¶

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:

  • How to download HTML pages from a website?
  • How to extract relevant content from an HTML page?

Furthermore, you will gain a deeper understanding of the data science lifecycle.

Requirements:

  1. Please use pandas.DataFrame rather than spark.DataFrame to manipulate data.

  2. Please use BeautifulSoup rather than lxml to parse an HTML page and extract data from the page.

  3. Please follow the python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

Preliminary¶

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: Tutorial: Web Scraping and BeautifulSoup.

Please let me know if you find a better resource. I'll share it with the other students.

Overview¶

Imagine you are a data scientist working at SFU. Your job is to extract insights from SFU data to answer questions.

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle from Lecture 1. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.

Task 1: SFU CS Faculty Members¶

Sometimes you don't know what questions to ask. No worries. Start collecting data first.

In Task 1, your job is to write a web scraper to extract the faculty information from this page: http://www.sfu.ca/computing/people/faculty.html.

(a) Crawl Web Page¶

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("csfaculty.html").

In [1]:
import requests

URL = 'http://www.sfu.ca/computing/people/faculty.html'
filename = 'csfaculty.html'

response = requests.get(URL)

if response.status_code == 200:
  # Response has succeeded
  html_content = response.text

  # Save the text to a file
  with open(filename, 'w', encoding='utf-8') as file:
    file.write(html_content)
  print(f'File {filename} successfully saved.')
else:
  print('Get response failed. Error code:', response.status_code)
File csfaculty.html successfully saved.

(b) Extract Structured Data¶

Please write code to extract relevant content (name, rank, area, profile, homepage) from "csfaculty.html" and save them as a CSV file (like faculty_table.csv).

In [2]:
from bs4 import BeautifulSoup as BS
import pandas as pd

with open(filename, 'r', encoding='utf-8') as file:
  html_content = file.read()

soup = BS(html_content, 'html.parser')

selector = '#page-content > section > div.main_content.parsys > div.parsys_column.cq-colctrl-lt0.people.faculty-list'

faculty_list = soup.select(selector)

data = []

for item in faculty_list:
  faculty_members = item.find_all(class_='textimage section')

  for member in faculty_members:
    # extract the content for each faulty member
    member_content = member.find('div', class_='text')

    # get the name and rank
    name_rank = member_content.find('h4').text.strip()
    name_rank = name_rank.split(',')
    name = name_rank[0].strip().title()
    if len(name_rank) > 1:
      rank = name_rank[1].split('\n')[0].strip().title()
    else:
      rank = None

    # get the area of specialty
    area_p_tag = member_content.find_all('p')
    area = area_p_tag[0].text.strip() if area_p_tag else 'p tag not found'
    area = area.replace('Area:', '').strip().title()

    # get both urls
    if len(area_p_tag) > 1:
      hrefs = [a.get('href') for a in area_p_tag[1].find_all('a') if a.get('href')]
    else:
      hrefs = [a.get('href') for a in area_p_tag[0].find_all('a') if a.get('href')]
    if len(hrefs) > 1 and hrefs[0] == hrefs[1]:
      profile = 'http://www.sfu.ca' + hrefs[0]
      homepage = 'http://www.sfu.ca' + hrefs[0]
    elif len(hrefs) == 1:
      profile = 'http://www.sfu.ca' + hrefs[0]
      homepage = None
    else:
      profile = 'http://www.sfu.ca' + hrefs[0]
      homepage = hrefs[1]

    member_data = [name, rank, area, profile, homepage]

    data.append(member_data)

headers = ['name', 'rank', 'area', 'profile', 'homepage']

# Create a pandas dataframe
df = pd.DataFrame(data, columns=headers)

# Save the dataframe to a file
outfilename = 'faculty_table.csv'
df.to_csv(outfilename)
print(f'Dataframe {outfilename} successfully saved.')
Dataframe faculty_table.csv successfully saved.

(c) Interesting Finding¶

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage exploratory data analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will learn it soon from this course.

First, please install dataprep. Then, run the cell below. It shows a bar chart for every column. What interesting findings can you get from these visualizations?

In [3]:
!pip install dataprep
Collecting dataprep
  Downloading dataprep-0.4.5-py3-none-any.whl (9.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.9/9.9 MB 29.1 MB/s eta 0:00:00
Requirement already satisfied: aiohttp<4.0,>=3.6 in /usr/local/lib/python3.10/dist-packages (from dataprep) (3.9.1)
Collecting bokeh<3,>=2 (from dataprep)
  Downloading bokeh-2.4.3-py3-none-any.whl (18.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.5/18.5 MB 36.8 MB/s eta 0:00:00
Requirement already satisfied: dask[array,dataframe,delayed]>=2022.3.0 in /usr/local/lib/python3.10/dist-packages (from dataprep) (2023.8.1)
Requirement already satisfied: flask<3,>=2 in /usr/local/lib/python3.10/dist-packages (from dataprep) (2.2.5)
Collecting flask_cors<4.0.0,>=3.0.10 (from dataprep)
  Downloading Flask_Cors-3.0.10-py2.py3-none-any.whl (14 kB)
Requirement already satisfied: ipywidgets<8.0,>=7.5 in /usr/local/lib/python3.10/dist-packages (from dataprep) (7.7.1)
Collecting jinja2<3.1,>=3.0 (from dataprep)
  Downloading Jinja2-3.0.3-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.6/133.6 kB 8.9 MB/s eta 0:00:00
Collecting jsonpath-ng<2.0,>=1.5 (from dataprep)
  Downloading jsonpath_ng-1.6.1-py3-none-any.whl (29 kB)
Collecting metaphone<0.7,>=0.6 (from dataprep)
  Downloading Metaphone-0.6.tar.gz (14 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: nltk<4.0.0,>=3.6.7 in /usr/local/lib/python3.10/dist-packages (from dataprep) (3.8.1)
Requirement already satisfied: numpy<2.0,>=1.21 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.23.5)
Requirement already satisfied: pandas<2.0,>=1.1 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.5.3)
Requirement already satisfied: pydantic<2.0,>=1.6 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.10.13)
Requirement already satisfied: pydot<2.0.0,>=1.4.2 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.4.2)
Collecting python-crfsuite==0.9.8 (from dataprep)
  Downloading python_crfsuite-0.9.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 29.5 MB/s eta 0:00:00
Collecting python-stdnum<2.0,>=1.16 (from dataprep)
  Downloading python_stdnum-1.19-py2.py3-none-any.whl (1.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 44.3 MB/s eta 0:00:00
Collecting rapidfuzz<3.0.0,>=2.1.2 (from dataprep)
  Downloading rapidfuzz-2.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 59.2 MB/s eta 0:00:00
Collecting regex<2022.0.0,>=2021.8.3 (from dataprep)
  Downloading regex-2021.11.10-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (764 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 764.0/764.0 kB 60.3 MB/s eta 0:00:00
Requirement already satisfied: scipy<2.0,>=1.8 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.11.4)
Collecting sqlalchemy==1.3.24 (from dataprep)
  Downloading SQLAlchemy-1.3.24.tar.gz (6.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.4/6.4 MB 84.5 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: tqdm<5.0,>=4.48 in /usr/local/lib/python3.10/dist-packages (from dataprep) (4.66.1)
Collecting varname<0.9.0,>=0.8.1 (from dataprep)
  Downloading varname-0.8.3-py3-none-any.whl (21 kB)
Requirement already satisfied: wordcloud<2.0,>=1.8 in /usr/local/lib/python3.10/dist-packages (from dataprep) (1.9.3)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (23.2.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.9.4)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.4.1)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (1.3.1)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp<4.0,>=3.6->dataprep) (4.0.3)
Requirement already satisfied: packaging>=16.8 in /usr/local/lib/python3.10/dist-packages (from bokeh<3,>=2->dataprep) (23.2)
Requirement already satisfied: pillow>=7.1.0 in /usr/local/lib/python3.10/dist-packages (from bokeh<3,>=2->dataprep) (9.4.0)
Requirement already satisfied: PyYAML>=3.10 in /usr/local/lib/python3.10/dist-packages (from bokeh<3,>=2->dataprep) (6.0.1)
Requirement already satisfied: tornado>=5.1 in /usr/local/lib/python3.10/dist-packages (from bokeh<3,>=2->dataprep) (6.3.2)
Requirement already satisfied: typing-extensions>=3.10.0 in /usr/local/lib/python3.10/dist-packages (from bokeh<3,>=2->dataprep) (4.5.0)
Requirement already satisfied: click>=8.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (8.1.7)
Requirement already satisfied: cloudpickle>=1.5.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (2.2.1)
Requirement already satisfied: fsspec>=2021.09.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (2023.6.0)
Requirement already satisfied: partd>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (1.4.1)
Requirement already satisfied: toolz>=0.10.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (0.12.0)
Requirement already satisfied: importlib-metadata>=4.13.0 in /usr/local/lib/python3.10/dist-packages (from dask[array,dataframe,delayed]>=2022.3.0->dataprep) (7.0.1)
Requirement already satisfied: Werkzeug>=2.2.2 in /usr/local/lib/python3.10/dist-packages (from flask<3,>=2->dataprep) (3.0.1)
Requirement already satisfied: itsdangerous>=2.0 in /usr/local/lib/python3.10/dist-packages (from flask<3,>=2->dataprep) (2.1.2)
Requirement already satisfied: Six in /usr/local/lib/python3.10/dist-packages (from flask_cors<4.0.0,>=3.0.10->dataprep) (1.16.0)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (5.5.6)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (0.2.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (5.7.1)
Requirement already satisfied: widgetsnbextension~=3.6.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (3.6.6)
Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (7.34.0)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets<8.0,>=7.5->dataprep) (3.0.9)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2<3.1,>=3.0->dataprep) (2.1.3)
Collecting ply (from jsonpath-ng<2.0,>=1.5->dataprep)
  Downloading ply-3.11-py2.py3-none-any.whl (49 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.6/49.6 kB 6.5 MB/s eta 0:00:00
Requirement already satisfied: joblib in /usr/local/lib/python3.10/dist-packages (from nltk<4.0.0,>=3.6.7->dataprep) (1.3.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas<2.0,>=1.1->dataprep) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas<2.0,>=1.1->dataprep) (2023.3.post1)
Requirement already satisfied: pyparsing>=2.1.4 in /usr/local/lib/python3.10/dist-packages (from pydot<2.0.0,>=1.4.2->dataprep) (3.1.1)
Collecting asttokens<3.0.0,>=2.0.0 (from varname<0.9.0,>=0.8.1->dataprep)
  Downloading asttokens-2.4.1-py2.py3-none-any.whl (27 kB)
Collecting executing<0.9.0,>=0.8.3 (from varname<0.9.0,>=0.8.1->dataprep)
  Downloading executing-0.8.3-py2.py3-none-any.whl (16 kB)
Collecting pure_eval<1.0.0 (from varname<0.9.0,>=0.8.1->dataprep)
  Downloading pure_eval-0.2.2-py3-none-any.whl (11 kB)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from wordcloud<2.0,>=1.8->dataprep) (3.7.1)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata>=4.13.0->dask[array,dataframe,delayed]>=2022.3.0->dataprep) (3.17.0)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.10/dist-packages (from ipykernel>=4.5.1->ipywidgets<8.0,>=7.5->dataprep) (6.1.12)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (67.7.2)
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 78.6 MB/s eta 0:00:00
Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (4.4.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (3.0.43)
Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (2.16.1)
Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.2.0)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.1.6)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (4.9.0)
Requirement already satisfied: locket in /usr/local/lib/python3.10/dist-packages (from partd>=1.2.0->dask[array,dataframe,delayed]>=2022.3.0->dataprep) (1.0.0)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.10/dist-packages (from widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (6.5.5)
Requirement already satisfied: idna>=2.0 in /usr/local/lib/python3.10/dist-packages (from yarl<2.0,>=1.0->aiohttp<4.0,>=3.6->dataprep) (3.6)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud<2.0,>=1.8->dataprep) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud<2.0,>=1.8->dataprep) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud<2.0,>=1.8->dataprep) (4.47.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->wordcloud<2.0,>=1.8->dataprep) (1.4.5)
Requirement already satisfied: parso<0.9.0,>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.8.3)
Requirement already satisfied: pyzmq<25,>=17 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (23.2.1)
Requirement already satisfied: argon2-cffi in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (23.1.0)
Requirement already satisfied: jupyter-core>=4.6.1 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (5.7.0)
Requirement already satisfied: nbformat in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (5.9.2)
Requirement already satisfied: nbconvert>=5 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (6.5.4)
Requirement already satisfied: nest-asyncio>=1.5 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.5.8)
Requirement already satisfied: Send2Trash>=1.8.0 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.8.2)
Requirement already satisfied: terminado>=0.8.3 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.18.0)
Requirement already satisfied: prometheus-client in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.19.0)
Requirement already satisfied: nbclassic>=0.4.7 in /usr/local/lib/python3.10/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.0.0)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0->ipywidgets<8.0,>=7.5->dataprep) (0.2.12)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.6.1->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (4.1.0)
Requirement already satisfied: jupyter-server>=1.8 in /usr/local/lib/python3.10/dist-packages (from nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.24.0)
Requirement already satisfied: notebook-shim>=0.2.3 in /usr/local/lib/python3.10/dist-packages (from nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.2.3)
Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (4.9.4)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (4.11.2)
Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (6.1.0)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.4)
Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.3.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.8.4)
Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.9.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.5.0)
Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.2.1)
Requirement already satisfied: fastjsonschema in /usr/local/lib/python3.10/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (2.19.1)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (4.19.2)
Requirement already satisfied: argon2-cffi-bindings in /usr/local/lib/python3.10/dist-packages (from argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (21.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (2023.12.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.32.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.16.2)
Requirement already satisfied: anyio<4,>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from jupyter-server>=1.8->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (3.7.1)
Requirement already satisfied: websocket-client in /usr/local/lib/python3.10/dist-packages (from jupyter-server>=1.8->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.7.0)
Requirement already satisfied: cffi>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.16.0)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (2.5)
Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert>=5->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (0.5.1)
Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.1.0->jupyter-server>=1.8->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.3.0)
Requirement already satisfied: exceptiongroup in /usr/local/lib/python3.10/dist-packages (from anyio<4,>=3.1.0->jupyter-server>=1.8->nbclassic>=0.4.7->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (1.2.0)
Requirement already satisfied: pycparser in /usr/local/lib/python3.10/dist-packages (from cffi>=1.0.1->argon2-cffi-bindings->argon2-cffi->notebook>=4.4.1->widgetsnbextension~=3.6.0->ipywidgets<8.0,>=7.5->dataprep) (2.21)
Building wheels for collected packages: sqlalchemy, metaphone
  Building wheel for sqlalchemy (setup.py) ... done
  Created wheel for sqlalchemy: filename=SQLAlchemy-1.3.24-cp310-cp310-linux_x86_64.whl size=1252705 sha256=5096547686a50bb6111e4668650c2122c7eecd118549cfb216de05c870d8e5ff
  Stored in directory: /root/.cache/pip/wheels/27/51/b3/3481e88d5a5ba95dd4aafedc9316774d941c4ba61cfb93add8
  Building wheel for metaphone (setup.py) ... done
  Created wheel for metaphone: filename=Metaphone-0.6-py3-none-any.whl size=13902 sha256=213470affeeb79690dc6f19a5e0441753c0ebe64f9ccae8c163e01de664ba51b
  Stored in directory: /root/.cache/pip/wheels/23/dd/1d/6cdd346605db62bde1f60954155e9ce48f4681c243f265b704
Successfully built sqlalchemy metaphone
Installing collected packages: regex, python-stdnum, python-crfsuite, pure_eval, ply, metaphone, executing, sqlalchemy, rapidfuzz, jsonpath-ng, jinja2, jedi, asttokens, varname, bokeh, flask_cors, dataprep
  Attempting uninstall: regex
    Found existing installation: regex 2023.6.3
    Uninstalling regex-2023.6.3:
      Successfully uninstalled regex-2023.6.3
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 2.0.24
    Uninstalling SQLAlchemy-2.0.24:
      Successfully uninstalled SQLAlchemy-2.0.24
  Attempting uninstall: jinja2
    Found existing installation: Jinja2 3.1.2
    Uninstalling Jinja2-3.1.2:
      Successfully uninstalled Jinja2-3.1.2
  Attempting uninstall: bokeh
    Found existing installation: bokeh 3.3.2
    Uninstalling bokeh-3.3.2:
      Successfully uninstalled bokeh-3.3.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires python-multipart, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.
bigframes 0.18.0 requires sqlalchemy<3.0dev,>=1.4, but you have sqlalchemy 1.3.24 which is incompatible.
ipython-sql 0.5.0 requires sqlalchemy>=2.0, but you have sqlalchemy 1.3.24 which is incompatible.
panel 1.3.6 requires bokeh<3.4.0,>=3.2.0, but you have bokeh 2.4.3 which is incompatible.
Successfully installed asttokens-2.4.1 bokeh-2.4.3 dataprep-0.4.5 executing-0.8.3 flask_cors-3.0.10 jedi-0.19.1 jinja2-3.0.3 jsonpath-ng-1.6.1 metaphone-0.6 ply-3.11 pure_eval-0.2.2 python-crfsuite-0.9.8 python-stdnum-1.19 rapidfuzz-2.15.2 regex-2021.11.10 sqlalchemy-1.3.24 varname-0.8.3
In [4]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_table.csv")
plot(df)

Out[4]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 6
Number of Rows 70
Missing Cells 15
Missing Cells (%) 3.6%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 31.1 KB
Average Row Size in Memory 454.3 B
Variable Types
  • Numerical: 1
  • Categorical: 5
Dataset Insights
Unnamed: 0 is uniformly distributed Uniform
rank has 2 (2.86%) missing values Missing
homepage has 13 (18.57%) missing values Missing
name has a high cardinality: 70 distinct values High Cardinality
area has a high cardinality: 58 distinct values High Cardinality
profile has a high cardinality: 70 distinct values High Cardinality
homepage has a high cardinality: 57 distinct values High Cardinality
name has all distinct values Unique
profile has all distinct values Unique
homepage has all distinct values Unique

Number of plots per page:

Unnamed: 0
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
Column Insights
  1. Unnamed: 0 is uniformly distributed
name
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
Column Insights
  1. name has a high cardinality: 70 distinct values
  2. name has all distinct values
rank
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
Column Insights
  1. rank has 2 (2.86%) missing values
area
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
Column Insights
  1. area has a high cardinality: 58 distinct values
profile
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
Column Insights
  1. profile has a high cardinality: 70 distinct values
  2. profile has all distinct values
homepage
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
Column Insights
  1. homepage has 13 (18.57%) missing values
  2. homepage has a high cardinality: 57 distinct values
  3. homepage has all distinct values

Below are some examples:

Finding 1: Professor# (26) is more than 2x larger than Associate Professor# (10).

Questions: Why did it happen? Is it common in all CS schools in Canada? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?

Finding 2: The Homepage has 22% missing values.

Questions: Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future?

Task 2: Age Follows Normal Distribution?¶

In this task, you start with a question and then figure out what data to collect.

The question that you are interested in is Does SFU CS faculty age follow a normal distribution?

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (gradyear) and then estimate age using the following equation:

$$age \approx 2021+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2021+23-1990 = 54.

(a) Crawl Web Page¶

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Jiannan Wang graduated from Harbin Institute of Technology in 2008 at http://www.sfu.ca/computing/people/faculty/jiannanwang.html.

Please write code to download the 68 profile pages and save each page as a text file.

In [6]:
# Iterate over df['profile'] and save each page as a text file
import os
directory = 'profile_pages'
if not os.path.exists(directory):
  os.makedirs(directory)

# Use a loop to attempt multiple times in case the get request fails
from time import sleep

max_retries = 10
retry_delay = 2

for link in df['profile']:
  # Catch instances where http://www.sfu.ca has been appended twice to the str
  head = 'http://www.sfu.cahttp://www.sfu.ca'
  if link[0:len(head)] == head:
    link = link[len(head)//2:]

  for attempt in range(max_retries):
    try:
      response = requests.get(link)
      response.raise_for_status()  # Raises an HTTPError if the response was an HTTP error
      text = response.text
      filename = link
      filename = filename.split('/')[-1].split('.')[0].title() + '.txt'
      file_path = os.path.join(directory, filename)

      if os.path.exists(file_path):
        break # We already have this information

      with open(file_path, 'w', encoding='utf-8') as file:
        file.write(text)

      break  # Successful request, exit the loop

    except requests.RequestException as e:
      print(f'Attempt {attempt + 1} failed: {e}')
      sleep(retry_delay)
  else:
    print(f'All attempts failed for link {link}.')
Attempt 1 failed: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

(b) Extract Structured Data¶

Please write code to extract the earliest graduation year (e.g., 2008 for Dr. Jiannan Wang) from each profile page, and create a csv file like faculty_grad_year.csv.

In [25]:
import re

headers = ['name', 'gradyear']

data = []

# iterate over each file in the directory
for filename in os.listdir(directory):

  # Check if it is a text file
  if filename.endswith('.txt'):
    file_path = os.path.join(directory, filename)

  # Open the file
  with open(file_path, 'r', encoding='utf-8') as file:
    soup = BS(file, 'html.parser')

    # Get the name
    title = soup.find('title').text.strip().title()
    name = title.split(' - ')[0]

    # Get the year
    divs = soup.find_all('div', class_='text parbase section')
    for div in divs:
      education_h2 = div.find(lambda tag: tag.name == 'h2' and tag.text.strip() == 'Education')
      if education_h2 is not None:
        text_list = [ele.text.strip() for ele in div if not None]
        for text in text_list:
          integers = re.findall(r'\d+', text)
          integers = [int(i) for i in integers if len(i) == 4]
          if len(integers) > 0:
            year = min(integers)
            row_data = [name, year]
            data.append(row_data)

name_year_df = pd.DataFrame(data, columns=(headers))
name_year_df.to_csv("faculty_grad_year.csv")

(c) Interesting Finding¶

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: Does SFU CS faculty age follow a normal distribution?

In [26]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2021+23-df["gradyear"]

plot(df, "age")

Out[26]:
DataPrep.EDA Report

Overview

Approximate Distinct Count30
Approximate Unique (%)47.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size1008
Mean44.0635
Minimum27
Maximum67
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum27
5-th Percentile28.1
Q133
Median43
Q353
95-th Percentile62.9
Maximum67
Range40
IQR20

Descriptive Statistics

Mean44.0635
Standard Deviation11.7705
Variance138.5443
Sum2776
Skewness0.174
Kurtosis-1.205
Coefficient of Variation0.2671
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • age is skewed right (γ1 = 0.174)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
53 5
 
7.9%
29 4
 
6.3%
33 4
 
6.3%
39 4
 
6.3%
60 4
 
6.3%
27 3
 
4.8%
32 3
 
4.8%
52 3
 
4.8%
30 2
 
3.2%
31 2
 
3.2%
Other values (20) 29
46.0%

Submission¶

Complete the code in this notebook, and submit it to the CourSys activity Assignment 1.